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Abstract 

Background: It is well known that the development of cancer is caused by the accumulation of somatic mutations 
within the genome. For oncogenes specifically, current research suggests that there is a small set of "driver" mutations 
that are primarily responsible for tumorigenesis. Further, due to recent pharmacological successes in treating these 
driver mutations and their resulting tumors, a variety of approaches have been developed to identify potential driver 
mutations using methods such as machine learning and mutational clustering. We propose a novel methodology 
that increases our power to identify mutational clusters by taking into account protein tertiary structure via a graph 
theoretical approach. 

Results: We have designed and implemented GraphPAC (Graph Protein Amino acid Clustering) to identify 
mutational clustering while considering protein spatial structure. Using GraphPAC, we are able to detect novel clusters 
in proteins that are known to exhibit mutation clustering as well as identify clusters in proteins without evidence of 
prior clustering based on current methods. Specifically, by utilizing the spatial information available in the Protein 
Data Bank (PDB) along with the mutational data in the Catalogue of Somatic Mutations in Cancer (COSMIC), GraphPAC 
identifies new mutational clusters in well known oncogenes such as EGFR and KRAS. Further, by utilizing graph theory 
to account for the tertiary structure, GraphPAC discovers clusters in DPP4, NRP1 and other proteins not identified by 
existing methods. The R package is available at: http://bioconductor.org/packages/release/bioc/html/GraphPAC.html. 

Conclusion: GraphPAC provides an alternative to iPAC and an extension to current methodology when identifying 
potential activating driver mutations by utilizing a graph theoretic approach when considering protein tertiary 
structure. 



Background 

Cancer, one of the most widespread and heterogeneous 
diseases, is at its most fundamental level a disease 
brought on by the accumulation of somatic mutations 
[1]. These mutations typically occur in either tumor 
suppressors or oncogenes. While oncogenic mutations 
either tend to deregulate or up-regulate the result- 
ing protein behavior, mutations within tumor suppres- 
sors typically lower the activity of genes that prevent 
cancer. Pharmacological intervention has shown to be 
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more effective with inhibiting activating oncogenes than 
with restoring functionality of tumor suppressing genes. 
Combined with the theory of "oncogene addiction", 
that many cancers are dependent upon a small set of 
key genes to drive their rapid cellular multiplication 
with the rest of the mutations simply being passenger 
mutations [2,3], the identification of driver oncogenic 
mutations has become of critical importance in cancer 
research. 

Due to the importance of this problem, several approaches 
have been proposed to detect naturally selected regions in 
which activating mutations occur. One general approach 
postulates that driver mutations will have a higher non- 
synonymous mutation rate as compared to the back- 
ground level after normalizing for the length of the 
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gene [4-6]. Similarly, assuming that the neutral rate of 
nucleotide substitution is surpassed when positive selec- 
tion is acting on a specific region, one can check if the 
ratio of nonsynonymous (K a ) to synonymous (K s ) muta- 
tions per site is greater than 1 [7]. Relatedly, Ye et al 
[8] and Ryslik et al. [9] showed that mutational clus- 
ters can be indicative of activating mutations and that 
finding such clusters is a way to reduce the driver muta- 
tion search space needing to be analyzed. An alternative 
approach relies on creating classifiers to categorize muta- 
tions. Machine learning algorithms such as Polyphen-2 
[10], which predicts whether a missense mutation is dam- 
aging, and CHASM [11-13], which discriminates between 
known driver mutations and a set of passenger muta- 
tions, rely upon a set of rules developed using a variety 
of machine learning techniques such as Random Forests 
[14] and Support Vector Machines [15]. These rules can 
be used to calculate a score for each mutation based 
upon both sequence and non-sequence-based features 
such as evolutionary conservation, size and polarity of 
the substituted residue as well as accessible surface area 
[16]. Other classifiers, such as SIFT [17], use only a sub- 
set of these features, e.g. evolutionary conservation, for 
prediction. 

While the methods based upon background mutational 
rates have had some success in identifying regions of 
positive selections or driver mutations, they nonetheless 
suffer from several shortcomings. First, many of these 
methods rely upon calculating the difference between syn- 
onymous and non- synonymous mutations but do not take 
into account that selection can act upon minute regions 
of the gene. Thus, when the mutations rates are averaged 
over the entire gene, the signal may be lost. Second, the 
methods proposed by Kreitman [7] and Wang [4] do not 
differentiate between activating gain-of-function muta- 
tions and inactivating loss-of-function non-synonymous 
mutations. Third, many of the machine learning meth- 
ods require an extensive rule set that must first be trained 
using a well annotated database that is still limited. Until 
the requisite literature and information is developed, the 
machine learning algorithm is unable to create a well- 
performing classifier. Furthermore, the rules must be 
updated periodically to reflect updated knowledge and 
information. For a recent review of several popular meth- 
ods that attempt to discern missense substitution effect on 
protein function see Gnad et al. [18] and Gonzalez-Perez 
etal [19]. 

Building on the work of Bardelli et al [5] and Torkamani 
and Schork [20], which stipulated that only a small num- 
ber of specific mutations can activate a protein, Ye et al. [8] 
developed Non-Random Mutational Clustering (NMC) to 
identify potential activating mutations. NMC works on 
the hypothesis that absent any previously known muta- 
tional hotspot, a mutational cluster is indicative of a 



possible activating mutation. This is based on the observa- 
tion that most amino acid substitutions are either neutral 
or incompatible with protein function, resulting in a con- 
centration of activating mutations within a small subset 
of protein residues and domains [8]. For the null hypoth- 
esis that mutation locations are random in the candidate 
protein when represented in linear form, NMC identifies 
clustering by evaluating whether there is statistical evi- 
dence of mutations occurring closer together on the line 
than expected by chance. While NMC is able to impli- 
cate some cancer related genes, it is limited by the fact 
that it considers the protein as a linear sequence and 
does not take into account the tertiary protein struc- 
ture. To account for protein structure information, Ryslik 
etal. [9] developed iPAC (identification of Protein Amino 
acid Clustering), which reorganizes the protein into a one 
dimensional space that preserves, as best as possible, the 
three dimensional amino acid pairwise distances using 
Multidimensional Scaling (MDS) [21]. As described by 
Ryslik et al. [9], utilizing the tertiary information is crit- 
ical when identifying clustering as mutations that occur 
far apart when the protein is considered linearly can be 
very close together once the protein is folded in 3D space. 
The 3D proximity of such mutations might thus yield 
novel clusters. While it was shown that iPAC provides 
an improvement over NMC, the reliance upon a global 
method like MDS can potentially result in a distorted 
rearrangement of the protein, since distant residues will 
nevertheless have an impact on each others final position 
in one dimensional space. 

In this manuscript, we provide an alternative method 
to iPAC by remapping the protein into one dimensional 
space via a graph theoretic approach. This approach 
allows for a more natural consideration of the protein, 
one that is sensitive to protein domains and linkers. We 
show that our methodology is effective in identifying pro- 
teins with mutational clustering that are missed by both 
iPAC and NMC such as NRP1 and MAPK24. We also 
show that for some proteins, GraphPAC identifies fewer 
clusters than inferred by both iPAC and NMC while for 
other proteins GraphPAC identifies more clusters than the 
other two methods. While both GraphPAC and iPAC are 
an improvement over NMC since they account for ter- 
tiary structure, the differences between GraphPAC and 
iPAC point to the fact that different rearrangements of the 
protein must be considered in order to better understand 
the mutational clustering landscape. We show that many 
of the clusters identified by GraphPAC are also classified 
as damaging by Polyphen-2 and as an activating muta- 
tion by CHASM. By providing a more complete picture 
of mutational clustering than iPAC or NMC individually, 
GraphPAC allows us to obtain a more accurate landscape 
of where potential activating mutations may occur on the 
protein. 
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Methods 

GraphPAC uses a four step approach to identifying 
mutational clusters. The first step, as described in 
Sections 'Obtaining mutational data and 'Obtaining the 
3D structural data', retrieves mutational and positional 
data from COSMIC [22] and the PDB [23], respectively. 
After reconciling the mutational and positional databases 
(Section 'Reconciling the structural and mutational data'), 
the residues are realized as a connected graph where 
each residue is a vertex whereupon the traveling salesman 
problem is heuristically solved in order to find the short- 
est path through the protein (Section 'Traveling salesman 
approach'). Once the shortest path has been identified, the 
protein residues are reordered along this path providing a 
one dimensional ordering of the protein. The linear NMC 
algorithm is then used to calculate which mutations are 
closer together than expected by chance. Lastly, the clus- 
ters are unmapped back into the original space and the 
results reported back to the user. We detail each of the 
steps in the sections below. 

Obtaining mutational data 

The mutational positions were obtained from the 58th 
version of the COSMIC database that was downloaded via 
the following ftp site: ftp.sanger.ac.uk/pub/CGP/cosmic. 
The database was implemented locally using Oracle llg. 
Only missense mutations that were classified as "Con- 
firmed somatic variant" or "Reported in another cancer 
sample as somatic" were selected, with nonsense and 
synonymous mutations excluded. Moreover, we only con- 
sidered mutations originating from studies that were clas- 
sified as whole gene screens. Next, since multiple studies 
can report mutational data from the same cell line, muta- 
tional redundancies were removed to avoid double count- 
ing the mutations. Lastly, only the proteins with a UniProt 
Accession Number [24] were kept in order to correctly 
match the mutational and positional data, resulting in 777 
proteins. See "COSMIC query" in Additional file 1 for the 
SQL code required to generate the mutational data. 

Obtaining the 3D structural data 

The PDB web interface was used to obtain the protein ter- 
tiary information for each of the 777 proteins described in 
Section 'Obtaining mutational data. Since multiple struc- 
tures are often available for the same protein, all structures 
with a matching UniProt Accession Number were used 
and an appropriate multiple comparisons adjustment (see 
Section 'Multiple comparison adjustment for structures') 
was performed afterwards. For proteins where the resolu- 
tion provided alternative conformations, the first confor- 
mation listed in the file was used. Similarly, for structures 
where more than one polypeptide chain with a matching 
Uniprot Accession Number was available, the first match- 
ing chain listed in the file was used (typically chain A). 



Finally, after the chain and conformation were selected, 
the cartesian coordinates of all the a -carbon atoms were 
used to represent the tertiary backbone structure of the 
protein. While we only used the a -carbon location to rep- 
resent the residue location in this paper, our methods are 
robust if any of the other backbone atoms are used includ- 
ing the amide nitrogen, main chain carbonyl carbon or the 
main chain carbonyl oxygen. 

Also, while X-ray crystallography was used to determine 
many of the tertiary structures in the PDB, we note that 
molecular dynamics (MD) could in principle be used to 
model the protein structure in solution. However, taking 
into account the time complexity of such an approach 
for larger proteins as well as the number of structures 
that we consider, such a task is beyond the scope of 
this paper [25]. Further, as crystal structures are almost 
always representative of the correctly folded protein, using 
the current structural information is more than suffi- 
cient until MD simulations can be applied on much faster 
time scales. See "Structure Files" in Additional file 2 for 
a full listing of all the 1,904 structure/chain combinations 
used. 

Reconciling the structural and mutational data 

In order to reference the same residue in the COSMIC and 
PDB databases, an alignment was performed to accom- 
modate their different numbering systems. Like iPAC, 
GraphPAC allows two such reconciliations. The first is 
based upon a pairwise alignment as described in Pages 
et al. [26] while the second is based upon a numerical 
reconstruction from the structural information available 
in the PDB file. Due to the fact that the PDB file struc- 
ture potentially changes depending upon the structure 
release date along with other technical complications, 
pairwise alignment was used for all the analysis described 
in this paper unless specifically noted. For further infor- 
mation on the alignment please see the documentation in 
the GraphPAC package available on Bioconductor. Pro- 
tein/structure/chain combinations that resulted in only 
one mutation or no mutations on the residues for which 
tertiary information was available were dropped. Similar 
to iPAC, a successful alignment of the tertiary and muta- 
tional data was obtained for 140 proteins correspond- 
ing to 1100 unique structure/chain combinations. See 
"Structure Files" in Additional file 2 for a full listing and 
description. 

Traveling salesman approach 

Since the NMC algorithm requires order statistics to iden- 
tify clustering (see Section 'NMC'), we need to map the 
protein from a three dimensional to a one dimensional 
space so that order statistics may be constructed. Con- 
trary to iPAC, which employed MDS, a graph theoretic 
approach is used by GraphPAC As discussed above, one 
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major limitation of MDS is that the minimization of the 
stress function: 



cyclic permutation n of {1, 2, 3, . . . , n} that minimizes the 
total tour distance, namely: 



01 



E, V 4(X) 



(1) min 7 d(i, 7i(i)) 

71 L ' 



results in every residue having an effect on the final posi- 
tion of every other residue. In Equation 1, 8y represents 
the Euclidean distance between residues i and j in the orig- 
inal higher-dimensional space while ^y(X) represents the 
distance between them in the lower dimensional space 
X. Lastly, /, is used to account for situations where the 
proximity measures do not come from a true metric 
space. Since in our case, 8y e R, / is the identity func- 
tion. Minimization of o\ may not capture that a protein 
is typically comprised of several domains and that only 
residues within a specific domain should influence each 
others final position in linear space (see Figure 1). 

Under the GraphPAC algorithm, we first construct a 
complete graph with each residue represented by a ver- 
tex a . We then create a linear ordering of the protein 
by finding a Hamiltonian b path through the graph. As 
the number of distinct Hamiltonian paths on a complete 
graph with N vertices is equal to a direct consid- 

eration of all possible paths is computationally unfeasible. 
Further, selective pruning of the edges based upon edge 
distance is also often impractical due to the domain struc- 
ture where many residues are close to each other. Because 
of these factors, we use a heuristic algorithm that solves 
the Traveling Salesman Problem (TSP) [27,28] to find 
a linear path that is approximate of the shortest path 
through the protein. We then use this path as a representa- 
tive reordering of the protein into one dimensional space. 
Unlike iPAC, which is based on a global remapping, this 
methodology takes into account only locally neighboring 
residues to remap the protein to one dimensional space. 

While there are many heuristic solutions for the TSP 
(see Gutin and Punnen [29]), we consider three of the 
most common insertion methods [30]: cheapest inser- 
tion, farthest insertion and nearest insertion as described 
below. Specifically, the objective of the TSP is to find a 



i=l 

Here, d(i,j) represents the distance between residues i and 
; (with d(i,i) = 0) and n(i) represents the residue that 
follows residue i on the tour. The difference between the 
three insertion methods rests on how the next residue k is 
selected for insertion. Under cheapest insertion, the next k 
to be inserted into the tour is chosen such that the increase 
in tour length is minimal. Under nearest insertion, at each 
iteration, the k that is closest to a residue already on the 
tour is selected. Finally, under farthest insertion, the k that 
is farthest away from any residue already on the tour is 
selected. 

These algorithms have different upper bounds on their 
tour lengths. For example, the farthest insertion algo- 
rithm creates tours that approach | of the shortest length 
while the nearest and cheapest insertion algorithms can 
be linked to the minimal spanning tree algorithm and thus 
have an upper bound of twice the shortest tour length 
when distances satisfy the triangular inequality [28] . Due 
to the varied nature of these methods and that there is no 
biological justification to favor one over the other, we con- 
sider all three methods when identifying clusters and then 
perform an appropriate multiple comparison adjustment 
to infer the statistical evidence of mutation clusters (see 
Section 'Multiple comparison adjustment for structures'). 

As can be seen from Figure 2, all the rearrangement 
options present a positive skew and are mostly consistent 
with each other. For the majority of the proteins, all three 
insertion approaches as well as the MDS approach result 
in little rearrangement. However, if one method results 
in radical rearrangement when the protein is mapped to 
ID space, the other methods do so as well. This makes 
selection of a specific insertion method less critical and 
for the rest of this manuscript, unless otherwise speci- 
fied, we use the insertion method with the most significant 
cluster for analysis. Please see "Distribution Summary" in 
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Figure 1 An example protein with three different domains. Under iPAC, the Domain A residues will influence the final positions of Domain C 
residues and vice versa, a result that is undesirable if the three domains are independent of each other. The residues in Domain A and Domain C will 
have no effect on each other's final position via the graph theoretic approach. 
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Protein Rearrangement Comparison 
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Figure 2 The amount of rearrangement performed under each of the three insertion methods described as well as MDS. Each column on 
the x-axis represents one of the 1 1 00 structures considered, with structures from the same protein adjacent to one another and the protein order 
determined lexicographically by protein name. The y-axis shows the Kendall Tau distance, which is equivalent to the number of swaps required to 
sort the protein back into {1 ,2,3,. . . ,..} order using bubble sort. The proteins with at least one rearrangement higher than 1 50,000 represent the DPP4, 
F5, IDE, MET, PIK3Ca, SEC23A and TF proteins, from left to right, respectively. 



Additional file 3 for a full listing of each structures 
Kendall Tau distance, protein index and a high resolution 
plot. 

Path lengths between nearby and distant residues are 
statistically different 

We employed a statistical test to verify that the TSP algo- 
rithm yields a shorter path between residues that are close 
together in 3D space versus residues that are far apart. 
First, we selected 200 random protein structures from our 
data set. For each structure we then selected 100 ran- 
dom amino acids and categorized them as "close" versus 
"far" in 3D space (see Table 1 for more information on 
the classification). For structure i, we then calculated the 
average path distance between all pairwise close residues, 
denoted c/, and the average path distance between all 
pairwise far residues, denoted^. Next, we calculated the 
average close and far path distance over all structures: 
c =_(£^0/200 and / = (£^/5)/200. Finally, as c 
and / are averages, we applied the central limit theorem 
and performed a t-test with Hq : c = f vs H a : f > c. 
This test was performed for each of the insertion methods 



described in Section 'Traveling salesman approach' and at 
various classifications of close and far. 

As shown in Table 1, the p- value is ^ 0 for each com- 
bination of distance and insertion method, allowing us to 
conclude that the average path distance between residues 
that are far apart in 3D space is larger than the average 
path distance between residues that are close together in 
3D space. 

NMC 

The NMC algorithm as described by Ye et al. [8], and 
briefly reviewed here, was used to find the mutational 
clusters once the protein was remapped to ID space. To 
begin, suppose we had m samples of a protein that was N 
residues long and that there were a total of n mutations 
over all m proteins. As shown in Figure 3, by collapsing 
over the m samples, we can construct order statistics for 
every mutation. Then, given order statistics Xg^ and 
where i < k, we define a cluster to exist if Pr(Q/ = 
X(k) — X(i)) < a, for some predetermined significance 
level a. As shown in Ye et al. [8], while a closed form calcu- 
lation of the above probability is possible, it often becomes 



Table 1 This table shows the p values when testing the difference in path length between residues that are close versus 
far apart 



Close vs. Far 




Method 




Residue distance 


GraphPAC- Cheapest 


GraphPAC- Farthest 


GraphPAC- Nearest 


< 5 A vs > 25A 


9.50E-143 


2.10E-141 


1.38 E-145 


< 1 0 A vs > 30A 


9.32 E-147 


5.82 E-145 


2.81 E-149 


< 15 Avs> 35A 


2.91 E-151 


1.01 E-148 


1.33 E-153 


< 20 A vs > 40A 


1.42 E-155 


1.63 E-151 


7.15 E-158 



The left column shows the requirement to label two residues as close or far apart. For instance, "< 5A versus > 25A" signifies that residues that are less than < 5A 
apart are labeled as close while residues that are > 25A apart are considered far apart. The p-values for each method are shown in columns 2-4. 
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Figure 3 An example constructing order statistics over 3 
samples with 7 total mutations. The number inside the box 
indicates the residue number. A "*" above a residue signifies a 
non-synonymous missense substitution mutation for that residue. 
Figure from Ryslik etol. [9]. 



computationally costly. To overcome this, we calculate ^ 
and assume that the statistic is uniform on (0, 1). Then in 
the limit, it can be shown that: 



N N ~ 



~ J 0 (k-i-l)l(i + n-k)l 
= Pr(Beta(k -i,i + n- k+ 1) < c) 



(2) 



The above calculation is then performed on all pair- 
wise mutations and an appropriate multiple comparison 
adjustment is then applied. For the remainder of this 
study, we use the more conservative Bonferroni correction 
[31,32] to adjust for the intra-protein cluster p-values. See 
Section 'Multiple comparison adjustment for structures' 
for a description of how we account for the inter-protein 
multiple comparisons. Lastly, it is important to mention 
that the structural information obtained for each protein 
does not always contain the (x, y, z) coordinates for every 
residue in the protein. In such cases, in order to com- 
pare GraphPAC, iPAC and NMC on an equal basis, these 
missing residues are removed from the protein. 

We also note that since we obtained our mutational 
data from COSMIC, some tissue types are more repre- 
sented than others in the database. However, this scenario 
results in our analysis being more conservative and our 
findings even more significant. Assuming that mutations 
occur in different parts of the protein for different tissue 
types, when collapsing over all tissues a larger value of n is 
obtained while the values of i and k (as seen in Equation 2) 
for two specific mutations are not changed. This results in 
a larger p-value signifying that clusters found when col- 
lapsing over tissue types would be even more significant if 
only a unique tissue type was analyzed. 

Multiple comparison adjustment for structures 

In addition to the Bonferroni adjustment performed to 
account for multiple testing within a specific structure, 



we perform a second multiple comparison adjustment to 
account for testing all 1,100 structures. Since a single pro- 
tein can have many structures that are similar to each 
other, a second Bonferroni adjustment is too conservative 
and an integrated Bonferroni-FDR approach was per- 
formed. Specifically, for a given protein, the Bonferrroni 
adjusted p-value of each cluster was multiplied by n ^ n ~^ 
to calculate p* . Thus, p* could be compared directly to 
an a -level of 0.05 in order to determine the cluster's sig- 
nificance. Next, a rFDR[33] approach, which is a good 
approximation for the standard FDR method when there 
are a large number of independent or positively corre- 
lated tests, was used. Under this method, the expected 
value of a is estimated over all k tests and then used as 
the significance threshold. Setting k as the total number 
of structures under all three insertion methods, the mean 
alpha can be approximated by: 



rFDR 



where k = 3 x 1100 = 3300. Using a = 0.05, rFDR 
is calculated to be ^ 0.025007. Rounding down, all the 
clusters for which p* < 0.025 were deemed to be sig- 
nificant. To avoid confusion in the rest of the paper, we 
only report the p-value (with the exception of Table 2). 
However, each cluster discussed in Section 'Results' is sig- 
nificant after the Bonferroni-FDR multiple comparison 
adjustment described here. 

Results 

In this section we compare the results between Graph- 
PAC, iPAC and NMC in terms of the number of 
structures found (Section 'Method comparison') and 
describe the new proteins identified by GraphPAC 
(Section 'GraphPAC identifies novel proteins with signifi- 
cant clustering'). We also show the results of our method 
in comparison to two machine learning methods along 
with a descriptions of whether our results overlap biolog- 
ically relevant structures (Section 'Cluster localization in 
relevant sites and performance evaluation'). 

Method comparison 

Using the GraphPAC algorithm, out of the 140 proteins 
analyzed, 9, 10 and 12 proteins with statistically significant 
clusters were found under the cheapest, nearest and far- 
thest insertion methods, respectively. This corresponded 
to 223, 225 and 226 significant structures (out of the 1100 
total structures considered) under the three methods. It 
is important to note that failure to utilize the tertiary 
information results in either an over or an underestima- 
tion of the number of clusters in approximately 70% of 
the structures analyzed (see Figure 4). Hence, failure to 
account for the protein structure provides either an overly 
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Table 2 A comparison of the 15 proteins that were found to contain significant clustering via GraphPAC, iPAC or NMC 



GraphPAC iPAC NMC 



Protein 


p-value 


P* 


p-value 


P* 


p-value 


P* 


KRAS 


4.21 E-233 


4.33 E-229 


6.17E-185 


6.35 E-181 


4.39 E-233 


4.52 E-229 


TP53 


4.05 E-152 


4.48 E-147 


5.23 E-128 


6.11 E-123 


4.37 E-086 


5.30E-081 


BRAF 


3.84 E-130 


1.04 E-126 


3.73 E-130 


1.01 E-126 


3.84 E-130 


1.04 E-126 


PIK3CA 


8.20 E-084 


3.58 E-080 


8.20 E-084 


3.58 E-080 


8.20 E-084 


3.58 E-080 


NRAS 


8.26 E-029 


9.91 E-027 


5.38 E-026 


6.46 E-024 


8.26 E-029 


9.91 E-027 


HRAS 


1.54 E-014 


6.94 E-013 


1.23 E-010 


5.54 E-009 


5.61 E-010 


8.42 E-009 


AHT1 


2.47 E-005 


2.47 E-004 


1.18 E-005 


7.08 E-005 


2.47 E-005 


7.41 E-005 


IDE 


1 .56 E-003 


4.67 E-003 


2.20 E-005 


6.60 E-005 


1 .56 E-003 


4.67 E-003 


EGFR 


9.04 E-004 


9.04 E-003 


1 .35 E-004 


1 .35 E-003 






DPP4 


3.1 7 E-003 


3.63 E-002 










MAP2K4 


1.21 E-003 


1.21E-002 










NRP1 


1.58E-002 


1.58E-002 










PCSK9 


5.61 E-003 


1 .68E-002 










HA01 






7.95 E-003 


2.39 E-002 






EIF2AK2 






2.45 E-003 


7.36 E-003 







If a specific method did not find a particular protein to contain significant clustering, a "-" is shown. The p* calculation is described in Section 'Multiple comparison 
adjustment for structures'. The smallest p-value from all of the insertion methods was selected. 



complicated or overly simplified view of the mutational 
orientation. 

On the protein level, as shown in Table 2, eight proteins 
were identified as having significant clusters by Graph- 
PAC, NMC and iPAC while 7 proteins were identified as 
having significant clusters by only a subset of these meth- 
ods. We note that of these seven proteins, four of them 
were only identified via the GraphPAC methodology while 



two of them were identified only via the iPAC methodol- 
ogy. We further note that GraphPAC identifies the largest 
number of proteins with significant clustering at the same 
false discovery rate, thereby showing an increased power 
to detect mutational clustering. We also observe that 
there were no proteins found to have significant cluster- 
ing under the linear NMC algorithm that were subse- 
quently missed by the GraphPAC algorithm. See Section 



GraphPAC/Cheapest vs. NMC 

I Cheapest < NMC 
" Cheapest > NMC 
Cheapest = NMC = 1 
i Cheapest = NMO 1 



or d [jri r mv_/ v»ncd} 

m 



GraphPAC/Farthest vs. NMC 



GraphPAC/Nearest vs. NMC 



45 



60 



i Nearest < NMC 



I Nearest > NMC 



Nearest = NMC = 1 



I Nearest = NMC > 1 



i Farthest < NMC 
i Farthest > NMC 
Farthest = NMC = 1 
i Farthest = NMO 1 



GraphPAC/iPAC vs. NMC 



I iPAC < NMC 



liPAC>NMC 



iPAC=NMC = l 



UPAC=NMC>1 



Figure 4 A comparison of GraphPAC, iPAC and NMC overall the structures that were found to be significant. Each of the 3D methods are 
considered: all three GraphPAC insertion methods and iPAC. The size of each colored block represents the number of structures with the relationship 
described. For instance, out of the 223 structures with significant clusters found under the cheapest insertion method of GraphPAC (top left), 94 
structures had more clusters identified under the GraphPAC approach as compared to the NMC approach. Green is used to designate structures 
where the 3D and NMC methods identified 1 cluster while purple is used to designate structures where the 3D and NMC methods identified more 
than 1 cluster. 
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'Cluster localization in relevant sites and performance 
evaluation' for a summary of cluster overlap with active 
biological sites along with a performance evaluation via 
machine learning methods. 

GraphPAC identifies novel proteins with significant 
clustering 

GraphPAC identified four proteins with clustering that are 
missed by the iPAC algorithm: DPP4, MAP2K4, NRP1, 
and PSCK9. DPP4 is a serine protease that can modify 
tumor cell behavior and is a potential cancer therapeu- 
tic target [34]. Both MAP2K4 and NRP1 are well known 
to be associated with lung cancer [35,36]. Finally, while 
PCSK9 mutations are well known in causing hypercholes- 
terolemia [37], recent research shows that absence of 
PCSK9 can provide a protective benefit against melanoma 
due to lower circulating LDLc. This allows for a poten- 
tial additional cancer therapy via PCSK9 inhibitors [38]. 
[38]. For a full listing of which structure-protein combi- 
nations were found significant, see "Results Summary" in 
Additional file 4. Please see Sections 'GraphPAC finds 
novel proteins compared to iPAC and NMC, 'GraphPAC 
identifies additional clusters compared to iPAC and NMC 
and 'GraphPAC finds fewer clusters compared to NMC 
for an in-depth review of selected protein-structure com- 
binations. 

Cluster localization in relevant sites and performance 
evaluation 

We note that 9 of the 13 proteins that GraphPAC iden- 
tified as having significant clustering have their most 
significant cluster overlap a binding site, catalytic domain 
or kinase domain. Out of the remaining four proteins, 
three proteins have their most significant cluster fall 
within a previously identified biologically relevant region. 
For instance, IDE's most significant cluster is located on 
residues 684-698, a denaturation-resistant epitope region 
[39]. For NRP1, which plays roles in angiogenesis [40] and 
axon guidance [41], the most significant cluster directly 
overlaps the F5/8 type C 1 domain - a domain in many 
blood coagulation factors. Finally, for PIK3C-a, the most 
significant cluster overlaps residue 1047 which has been 
shown to potentially increase the substrate turnover rate, 
a common oncogenic behavior [42]. For further detail on 
relevant biological site information, please see "Relevant 
Sites" in Additional file 5. 

Further, we evaluated the performance of GraphPAC via 
two well-known machine learning algorithms: CHASM 
and PolyPhen-2. It is critical to first note however, that 
the machine learning algorithms utilize a much more 
detailed set of features when evaluating the mutation. 
Thus these algorithms may identify mutations as sig- 
nificant while GraphPAC would not. Nevertheless, of 
all the mutations that fall within significant clusters 



identified by GraphPAC, 93% and 91% of them were also 
identified as significant (FDR < 20%) by CHASM and 
PolyPhen-2 (respectively). We note that GraphPAC is only 
able to determine statistically significant clustering and 
not whether a mutation is truly damaging and/or acti- 
vating. However, given the high percentages described 
above, the evidence supports the hypothesis that cluster- 
ing is in fact indicative of potential driver mutations. Thus, 
via GraphPAC, the researcher has a fast and easily avail- 
able tool to identify potential driver mutations for further 
study. The benefit of GraphPAC is that it is able to be 
executed with far less prior information as compared to 
the machine learning approaches. For further details, see 
"Performance Evaluation" in Additional file 6. 

Finally, we note that while GraphPAC provides an 
improvement in cluster identification compared to prior 
work, the algorithm is unable to distinguish between 
mutations that increase or decrease kinase activity 
nor between gain-of-function (GOF) or loss-of-function 
(LOF) mutations. As described by Lapenna and Giordano 
[43], Brognard et al. [44], Geiger et al. [45], Ahn et al. 
[36], Lisabeth etal. [46] and Linka et al. [47], a large body 
of literature suggests that inactivating loss-of-function 
mutations are more common than previously thought and 
often occur in regions that regulate kinase activation. 
Nevertheless, as described above, many of the clusters 
identified by GraphPAC contain mutations that are clas- 
sified as driver and/or damaging by common machine 
learning algorithms. As such, GraphPAC provides a fast 
and easy method to identify such potential mutations, 
which can then be verified and analyzed via additional 
approaches. These approaches can range from the afore- 
mentioned machine learning algorithms to experimental 
approaches that test for GOF mutations as described by 
Fawdar etal. [48]. 

Discussion 

In this section we discuss in depth some of the clus- 
tering results presented in Section 'Results'. Specifically, 
we review in detail three situations: 1) GraphPAC iden- 
tifies novel proteins (Section 'GraphPAC finds novel pro- 
teins compared to iPAC and NMC), 2) GraphPAC finds 
additional clusters in proteins identified to contain clus- 
tering by other methods (Section 'GraphPAC identifies 
additional clusters compared to iPAC and NMC) and 3) 
GraphPAC finds fewer clusters compared to other meth- 
ods (Section 'GraphPAC finds fewer clusters compared to 
NMC). In each of these sections, we discuss the biological 
relevance of our findings. 

GraphPAC finds novel proteins compared to iPAC and NMC 

As shown in Table 2, GraphPAC identified five additional 
proteins as compared to the linear NMC algorithm. In this 
section we will consider two of these proteins, both of 
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which are directly related to cancer: EGFR, which is also 
identified by iPAC, and NRP1, which is not identified by 
iPAC 

EGFR is a cell-surface receptor for ligands in the epi- 
dermal growth factor family [49] and is present in a wide 
range of diseases such as glioblastoma multiforme [50], 
lung adenocarcinoma [51] and colorectal cancer [52]. The 
most significant cluster found was in the 2ITX struc- 
ture [53] between residues 719-768 (see Figure 5) with 
a corresponding p-value of 0.0009. This cluster contains 
mutations G719S, T751I and S768I which are all found 
in non-small cell lung carcinomas (NSCLC) [54-56] with 
mutation G719S well known for increased kinase activity 
[57] . It is also interesting to note that all three mutations 
within this cluster, which was identified purely through 
statistical clustering analysis, show a beneficial clinical 
response to either Erlotinib or Getfinib [55,58,59]. Exclu- 
sion of the tertiary information would have resulted in this 
cluster being missed. Finally, it is worth noting that recent 
research has shown that signaling by EGFR is depen- 
dent upon an allosteric interaction between two kinase 
domains in an asymmetric dimer as opposed to phospho- 
rylation. As the formation of this asymmetric dimer is 
believed to activate all EGFR family members, it is likely 
that oncogenic activation of EGFR may differ from other 
protein kinases [60,61]. 

We now consider the NRP-1 protein, a coreceptor for 
the vascular endothelial growth factor (VEGF) which 
is upregulated in a large variety of cancers including 
lung tumors [35], gastrointestinal metasteses [62] and 
pancreatic carcinomas [63]. In NSCLC patients, it has 
been shown to be an independent predictor of cancer 
relapse and reduced survival as well as a cancer invasion 
enhancer [64]. Moreover, research has shown that NRP-1 




Figure 5 The EGFR ectodomain fragment structure (PDB ID 2ITX) 
where the 719-768 cluster is colored in blue. The three mutations, 
71 9, 75 1 and 768 are displayed as purple spheres. 



inhibitors provide an additive effect to anti-VEGF therapy 
in reducing tumor progression. Monoclonal antibodies 
that attach to the bl-b2 domains, the domains responsi- 
ble for VEGF binding, have already been created [65]. The 
bl domain, which spans residues 275-424 almost exactly 
overlaps the most significant cluster found by GraphPAC, 
which consists of residues 277-432 (p-value 0.0158) in the 
2QQI [66] structure (Figure 6). Finally, it is worth not- 
ing that mutations on residues 297 and 320 were recently 
found that completely disrupt VEGF binding, both of 
which also fall within the GraphPAC identified cluster of 
277-432 in the 2QQI structure. 

GraphPAC identifies additional clusters compared to iPAC 
and NMC 

A representative example where GraphPAC identifies 
additional clusters as compared to NMC and iPAC is in 
the KRAS protein for the 3GFT structure 0 [67] (Figure 7). 
KRAS, a GTPase, is one of the most pervasively activated 
oncogenes, with some estimates stating that between 17- 
25% of all human tumors contain an activating mutation 
of the gene [68] . Due to the large number of samples with 
mutations in this gene and the resulting strong statisti- 
cal signal, GraphPAC, iPAC and NMC all identify that 
KRAS contains highly statistically significant mutational 
clusters. Nevertheless, GraphPAC identifies several novel 
clusters that are missed by iPAC and NMC. While all three 
methods identify clustering at residues 12-13, 12-61 and 
12-146, only iPAC and GraphPAC identify two additional 
clusters at 1) 61-117 and 2) 117-146. 

Moreover, only GraphPAC (under the cheapest and 
nearest insertion methods) identifies a statistically sig- 
nificant cluster for residues 12-23 and 23-61 as shown 
in Table 3. Considering the 12-23 cluster, we see that 
a sub-cluster of 12-13 is identified as well. This follows 
biological function as mutations on residues 12 and 13 
appear in a large variety of cancers, such as breast, lung, 




Figure 6 The NRP-1 structure (PDB ID 2QQI) where the 277-432 
cluster is colored in red. The mutations that disrupt VEGF binding, 
297 and 320 are shown as orange spheres while the end-points of the 
cluster, 277 and 432, are shown as purple spheres. 
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Figure 7 The KRAS structure (PDB ID 3GFT) color coded by 
region: amino acids 13-22 are blue, 24-60 are red and 62-145 
are yellow. Residues 1 2 and 1 3 which make up the most significant 
cluster are shown as purple spheres, while residues 23, 61 , 1 1 7 and 
146 are shown as brown spheres. 



bladder, pancreas and colon [6,69,70] while mutations on 
residues 22 and 23 appeared in colorectal/large intestine 
tissue samples in our data. It is interesting to note that 
germline mutations on residue 22 often result in develop- 
mental disorders such as Noonan Syndrome Type 3 (NS3) 
as well as Cardiofaciocutaneous Syndrome (CFC) [71,72]. 

Finally, the majority of mutations in cluster 61-146 also 
segregate along pathological lines with all the mutations 
in our data either occurring in lung or gastrointestinal 
tract/large intestine carcinomas. Specifically, residue 61 
mutations are typically found in colorectal and lung can- 
cer [6,73] while mutations K117N and A146T are found in 
colorectal cancer [6]. 

Table 3 P-value comparison of the three algorithms for 
several significant clusters 



Residues NMC iPAC GraphPAC 



12-13 


9.45 E-229 


3.91 E-165 


8.95 E-229 


12-23 






1.31 E-99 


12-61 


4.34 E-65 


2.38E E-87 


5.49 E-164 


12-146 


3.85 E-13 


3.81 E-90 


2.87 E-16 


23-61 






1.01 E-105 


61-146 




3.01 E-106 


4.35 E-31 


117-146 




1.66 E-102 





A "-" signifies that the method did not find that cluster to be significant. For 
GraphPAC, the cheapest insertion results are reported here. 



GraphPAC finds fewer clusters compared to NMC 

As seen from Figure 4, between 25%-40% of the structures 
identified with significant clustering had fewer clusters 
under the GraphPAC methodology as compared to the 
linear NMC algorithm with the vast majority of these 
structures corresponding to BRAF, HRAS and TP53. Here 
we consider a representative example, the 4E26 structure 
[74] for BRAF when analyzed using the farthest insertion 
method (Figure 8). As iPAC identified even more clus- 
ters than NMC, we compare GraphPAC to NMC in this 
section when showing that fewer mutational clusters is of 
benefit. Further, as V600 is well known to be the most 
likely mutated position in BRAF, the most significant clus- 
ter identified by GraphPAC, iPAC and NMC is located 
only on that residue with a p-value of 2.12 x 10 -129 under 
all three methods. In all, GraphPAC identifies 16 clusters 
while NMC identifies 22, with the differences shown in 
Table 4. 

Although it is outside the scope of this manuscript to 
consider every difference between Tables 4a and 4b, we 
observe that three of the longest clusters 464-671, 466- 
671 and 469-671 are dropped by GraphPAC. Since after 
alignment of the protein structural data to the mutational 
data (see Section 'Obtaining the 3D structural data'), ter- 
tiary information was available on residues 448-603 and 
610-723, these clusters cover 77.0%, 76.3% and 75.2% 
of all the available residues, respectively. By considering 
the 3D structure via GraphPAC, the longest clusters are 
dropped and the remaining overlapping clusters focus 
almost exclusively on residues 464-600. 

After structure and mutation alignment, the residue 
substitutions in significant clusters include: G464V, 
G466V, G469V, G469A, N581S, G596R, L597V, LV597R, 




Figure 8 The BRAF structure (PDB ID 4E26) color coded by 
segment: I) amino acids 464-599 are orange 2) amino acids 
601-671 are green. The a-carbons of the mutated residues 464,466, 
469, 581 , 596, 597 , 601 and 671 are shown as purple spheres. Residue 
600 is shown as a red sphere. 
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Table 4 A comparison of GraphPAC and NMC identified 
clusters for the BRAF structure 



(a) Clusters found by GraphPAC 








p-value 


Start 


End 


# Muts. 


GraphPAC 


NMC 


600 


600 


60 


2.12E-129 


2.12E-129 


597 


600 


62 


1.49E-104 


1.49E-104 


600 


601 


62 


1.49E-104 


9.22E-117 


596 


600 


64 


7.16E-102 


7.16E-102 


596 


601 


66 


3.37E-91 


1.16E-100 


597 


601 


64 


8.07E-91 


7.16E-102 


601 


671 


3 


5.85E-38 




600 


671 


63 


8.30E-37 


7.08E-26 


469 


601 


72 


2.59E-22 


5.92E-17 


581 


601 


68 


1.23E-21 


1.33E-65 


581 


600 


66 


2.94E-20 


3.13E-63 


469 


600 


70 


3.98E-20 


4.91 E-1 5 


466 


601 


74 


2.15E-17 


9.69E-19 


466 


600 


72 


7.01 E-16 


1.60E-16 


464 


601 


75 


1.15E-15 


1.12E-19 


464 


600 


73 


2.33E-14 


2.97E-17 


(b) Clusters found by NMC and dropped by GraphPAC 


Start 


End 


# Muts. 




NMC Pvalue 


596 


671 


67 




4.12E-29 


597 


671 


65 




4.79E-27 


581 


671 


69 




3.33E-26 


464 


671 


76 




5.92E-09 


466 


671 


75 




3.32E-08 


469 


671 


73 




8.11E-07 



A "-" for the NMC value signifies that cluster was not identified under the linear 
algorithm. 



V600E, V600K, K601N and R671Q. Since mutation 
R671Q does not have extensive literature and comes from 
a non-specified tissue sample in the COSMIC database, 
it will no longer be considered here. Thus, by consid- 
ering the tertiary structure, we significantly narrow the 
window of which residues to consider for potential driver 
mutations and can partition the protein into three seg- 
ments: I) 464-599 and II) 600 and III) 601. Segment I is 
primarily associated with lung and colorectal cancer as 
shown in [3,75-77]. Segment II represents the two most 
common mutations in BRAF, V600E and V600K. Over- 
all, 95% of BRAF mutations occur on V600, with some 
studies showing that V600E occurs within 73% to 79% of 
patients while V600K occurs within 12% to 19% of patients 
[78,79]. Mutations at this position result in the oncogene 
being constitutively activated with increased kinase activ- 
ity and have been found in a wide range of cancers such 



as metastatic melanoma [80], ovarian serous carcinoma 
[81] and hairy cell leukemia [82]. Furthermore, recent 
inhibitors, such as Vemurafenib and GSK2 118436 specif- 
ically target the V600E and V600E/K mutations (respec- 
tively), supporting the hypothesis that somatic clusters 
can provide pharmacological targets [83]. Lastly, segment 
III is comprised of the much less common K601N muta- 
tion which has been observed in myeloma cases along 
with V600E. Since these patients share the more common 
BRAF mutations as well, they may also potentially benefit 
from BRAF inhibitors [84]. 

Further, as shown in Section 'Results' and described 
above, GraphPAC finds fewer clusters for a significant 
percentage of the structures analyzed. Overall, the reduc- 
tion in total clusters identified can result from two 
sources: the removal of some residues because no tertiary 
data was available or the cluster is no longer significant 
when using the traveling salesman algorithm to account 
for 3D structure. The first case, which is already rare, will 
become increasingly more so as additional studies result 
in more complete and detailed structural information. For 
the second case, if a cluster is not found to be signifi- 
cant under GraphPAC when compared to NMC, a near or 
overlapping cluster is usually found (see Tables 4a and 4b). 
For BRAF specifically, under every type of graph insertion 
method (cheapest, nearest and farthest), every "probably 
damaging" or "possibly damaging" mutation (as classified 
by PolyPhen-2) was still identified in at least one signifi- 
cant cluster for the structure. For a complete analysis, see 
"Potential Driver Loss" in Additional file 7. 

Conclusion 

In this manuscript we provide an alternative method 
to utilize protein tertiary structure when identifying 
somatic mutation clusters. By employing a graph theoretic 
approach to restructuring the protein order, we identify 
both new clusters in proteins previously shown to have 
clustering as well as proteins that were not previously 
shown to have clustering. We have also provided several 
examples where we are able to identify clusters of muta- 
tions that may benefit from pharmacological treatment. 
Moreover, as GraphPAC uses the NMC algorithm to iden- 
tify clusters rather than a fixed window size, we are able to 
detect clusters of varying lengths. Finally, the methodol- 
ogy is fast and robust with the overwhelming majority of 
structure/protein combinations taking under 10 minutes 
each to analyze on a consumer desktop. 

The GraphPAC algorithm, while presenting a viable 
alternative to the MDS restriction of iPAC and an 
improvement over NMC, nevertheless contains several 
limitations. First, while no longer bound to the MDS 
requirement of iPAC, there is no closed form solution 
to the shortest path problem and our algorithm must 
appeal to heuristic approximations. Second, to satisfy the 
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uniformity assumption, the mutation status of all residues 
must be known ahead of time. With the growth of high- 
throughput sequencing however, this issue is temporary. 
Next, unequal rates of mutagenesis along with hyper- 
mutability of specific genomic regions may violate the 
assumption that every residue has a uniform probability 
of mutation. To help ensure that this assumption holds, 
we only consider single residue missense substitutions and 
have removed insertions and deletions from the analy- 
sis since they tend to be sequence dependent. Further, 
research has shown that CpG dinucleotides may have a 
mutational frequency ten times or higher compared to 
other dinucleotides [85]. However, in the analyses pre- 
sented in Sections 'GraphPAC finds novel proteins com- 
pared to iPAC and NMC, 'GraphPAC identifies additional 
clusters compared to iPAC and NMC, 'GraphPAC finds 
fewer clusters compared to NMC, only approximately 
13% of the mutations used to identify clustering occurred 
in CpG sites. Relatedly, colorectal carcinomas [86] con- 
tain more transition mutations while cigarette use results 
in more transversion mutations in lung carcinomas [8]. 
Still, when considering KRAS, the overwhelming majority 
of substitutions occur on residues 12, 13, and 61 for both 
colorectal and lung cancer, implying that while the muta- 
tional landscape may vary, it does not have a significant 
effect on mutation location and thus would not violate the 
uniformity assumption. Hence, while this analysis is influ- 
enced by a variety of factors, as are previous studies, it 
nevertheless appears that the primary cause of clustering 
is selection for a cancer phenotype. 

Several areas for future research are also directly evi- 
dent. First, an approach that considers the protein directly 
in 3D space via simulation may be employed. However, 
such an approach would not be able to use the order statis- 
tic methodology to identify clustering and thus might not 
be as sensitive for small mutation counts. Moreover, while 
we only consider distance when finding the shortest path 
through the graph, future research can incorporate the 
physico-chemical properties of the specific residues or 
domains by appropriately increasing or decreasing edge 
length. The potential additive effect of multiple cancer 
mutations in the same protein, as discussed in the case of 
EGFR by Hashimoto et al [87], can also be incorporated 
via additional refinement of the edge weights. Additional 
research is required in this area in order to incorporate 
these improvements. 

Overall however, GraphPAC utilizes protein tertiary 
structure via a graph theoretic approach in identify- 
ing mutational clustering. We show that this method 
identifies new clusters that are otherwise missed and 
that in some cases, pharmaceutical targets for muta- 
tions in these clusters have already been found and 
therapies created. Specifically, Erlotinib and Getfinib are 
used to target mutations in EGFR significant clusters 



(see Section 'GraphPAC finds novel proteins compared 
to iPAC and NMC) while Vemurafenib is used to tar- 
get mutations that occur within BRAF significant clusters 
(see Section 'GraphPAC finds fewer clusters compared to 
NMC). This helps confirm the hypothesis that mutational 
clustering may be indicative of driver mutations and as 
new protein structures become available, GraphPAC can 
provide a rapid methodology to identify such potential 
mutations. 

Availability and requirements 

Project Name: GraphPAC: Identification of Mutational 
Clusters in Proteins via a Graph Theoretical Approach. 
Project Home Page: http://www.bioconductor.org/ 
packages/release/bioc/html/GraphPAC.html 
Operating system(s): Platform independent 
Programming Language: R 

Other Requirements: R > 2.15 (see homepage for R 
package requirements). 
License: GPL-2 

Endnotes 

a Under a complete graph, every vertex is connected to 
every other vertex. The length of the edge between 
vertices i and / is set to be equal to the length between 
amino acids i and / in M 3 . 

b A Hamiltonian path is a walk through the graph that 
visits every vertex once and only once. 

c For this analysis, a manual reconstruction was 
performed in order to include residue 61 which is listed 
as a histidine under isoform 2B in the Uniprot Database 
and a glutamine in the COSMIC database. As the 
substitution of one amino acid in the structure would not 
have a significant impact on the spatial structure of the 
protein, and residue 61 is a highly mutated position, the 
residue was kept in the analysis. As a result, amino acids 
1-167 are used. 

Additional files 



Additional file 1 : Cosmic Query. The SQL query used to extract the 
mutations from COSMIC. 

Additional file 2: Structure Files. A detailed list of which 
protein-structure combinations were used and what chains were selected. 

Additional file 3: Distribution Summary. A detailed list of Kendall's 
Distance for each structure under each of the methods. 

Additional file 4: Results Summary. A summary of each structure's most 
significant p-value for both iPAC and NMC. 

Additional file 5: Relevant Sites. A review showing which of the iPAC 
clusters fall within structurally relevant sites. 

Additional file 6: Performance Evaluation. In-depth results validating 
the iPAC results using PolyPhen-2 and CHASM. 

Additional file 7: Potential Driver Loss. An analysis of whether any 
potential driver mutations are lost when iPAC finds fewer clusters than NMC. 
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